Universal Dependencies for Finnish
نویسندگان
چکیده
There has been substantial recent interest in annotation schemes that can be applied consistently to many languages. Building on several recent efforts to unify morphological and syntactic annotation, the Universal Dependencies (UD) project seeks to introduce a cross-linguistically applicable part-of-speech tagset, feature inventory, and set of dependency relations as well as a large number of uniformly annotated treebanks. We present Universal Dependencies for Finnish, one of the ten languages in the recent first release of UD project treebank data. We detail the mapping of previously introduced annotation to the UD standard, describing specific challenges and their resolution. We additionally present parsing experiments comparing the performance of a stateof-the-art parser trained on a languagespecific annotation schema to performance on the corresponding UD annotation. The results show improvement compared to the source annotation, indicating that the conversion is accurate and supporting the feasibility of UD as a parsing target. The introduced tools and resources are available under open licenses from http://bionlp.utu.fi/ud-finnish.html.
منابع مشابه
An annotation scheme for Persian based on Autonomous Phrases Theory and Universal Dependencies
A treebank is a corpus with linguistic annotations above the level of the parts of speech. During the first half of the present decade, three treebanks have been developed for Persian either originally or subsequently based on dependency grammar: Persian Treebank (PerTreeBank), Persian Syntactic Dependency Treebank, and Uppsala Persian Dependency Treebank (UPDT). The syntactic analysis of a sen...
متن کاملAutomatic Morpheme Segmentation and Labeling in Universal Dependencies Resources
Newer incarnations of the Universal Dependencies (UD) resources feature rich morphological annotation on the wordtoken level as regards tense, mood, aspect, case, gender, and other grammatical information. This information, however, is not aligned to any part of the word forms in the data. In this work, we present an algorithm for inferring this latent alignment between morphosyntactic labels a...
متن کاملTowards Universal Web Parsebanks
Recently, there has been great interest both in the development of cross-linguistically applicable annotation schemes and in the application of syntactic parsers at web scale to create parsebanks of online texts. The combination of these two trends to create massive, consistently annotated parsebanks in many languages holds enormous potential for the quantitative study of many linguistic phenom...
متن کاملAssessing the Annotation Consistency of the Universal Dependencies Corpora
A fundamental issue in annotation efforts is to ensure that the same phenomena within and across corpora are annotated consistently. To date, there has not been a clear and obvious way to ensure annotation consistency of dependency corpora. Here, we revisit the method of Boyd et al. (2008) to flag inconsistencies in dependency corpora, and evaluate it on three languages with varying degrees of ...
متن کاملPredicting Conjunct Propagation and Other Extended Stanford Dependencies
In this work, we present a data-driven method to enhance syntax trees with additional dependencies as defined in the wellknown Stanford Dependencies scheme, so as to give more information about the structure of the sentence. This hybrid method utilizes both machine learning and a rule-based approach, and achieves a performance of 93.1% in F1-score, as evaluated using an existing treebank of Fin...
متن کامل